Search CORE

97 research outputs found

Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition

Author: Chen Xin
Crandall David J
Sharghi Aidean
Xu Mingze
Publication venue
Publication date: 11/01/2018
Field of study

A major emerging challenge is how to protect people's privacy as cameras and computer vision are increasingly integrated into our daily lives, including in smart devices inside homes. A potential solution is to capture and record just the minimum amount of information needed to perform a task of interest. In this paper, we propose a fully-coupled two-stream spatiotemporal architecture for reliable human action recognition on extremely low resolution (e.g., 12x16 pixel) videos. We provide an efficient method to extract spatial and temporal features and to aggregate them into a robust feature representation for an entire action video sequence. We also consider how to incorporate high resolution videos during training in order to build better low resolution action recognition models. We evaluate on two publicly-available datasets, showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201

arXiv.org e-Print Archive

Crossref

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems

Author: Atkins Ella M.
Choi Chiho
Crandall David J.
Dariush Behzad
Xu Mingze
Yao Yu
Publication venue
Publication date: 03/03/2019
Field of study

Predicting the future location of vehicles is essential for safety-critical applications such as advanced driver assistance systems (ADAS) and autonomous driving. This paper introduces a novel approach to simultaneously predict both the location and scale of target vehicles in the first-person (egocentric) view of an ego-vehicle. We present a multi-stream recurrent neural network (RNN) encoder-decoder model that separately captures both object location and scale and pixel-level observations for future vehicle localization. We show that incorporating dense optical flow improves prediction results significantly since it captures information about motion as well as appearance change. We also find that explicitly modeling future motion of the ego-vehicle improves the prediction accuracy, which could be especially beneficial in intelligent and automated vehicles that have motion planning capability. To evaluate the performance of our approach, we present a new dataset of first-person videos collected from a variety of scenarios at road intersections, which are particularly challenging moments for prediction because vehicle trajectories are diverse and dynamic.Comment: To appear on ICRA 201

arXiv.org e-Print Archive

Crossref

Non-uniqueness of critical solid fraction considering boundary conditions and strain-rate effects

Author: Hanley Kevin J.
Huang Xin
Xu Mingze
Zhang Zixin
Publication venue: 'Elsevier BV'
Publication date: 25/05/2020
Field of study

Edinburgh Research Explorer

Identifying First-person Camera Wearers in Third-person Videos

Author: Crandall David J.
Fan Chenyou
Lee Jangwon
Lee Yong Jae
Ryoo Michael S.
Singh Krishna Kumar
Xu Mingze
Publication venue
Publication date: 20/04/2017
Field of study

We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene. To do this, we need to establish person-level correspondences across first- and third-person videos, which is challenging because the camera wearer is not visible from his/her own egocentric video, preventing the use of direct feature matching. In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge. We formulate the problem as learning a joint embedding space for first- and third-person videos that considers both spatial- and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct first- and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach performs significantly better than several baselines, in part by learning the first- and third-person features optimized for matching jointly with the distance measure itself

arXiv.org e-Print Archive

Crossref